Simple linear regression

Predict price of pizza based on size

The sklearn.linear_model.LinearRegression class is an estimator. Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the parameters of a model, and the latter method is used to predict the value of a response variable for an explanatory variable using the learned parameters. It is easy to experiment with different models using scikit-learn because all estimators implement the fit and predict methods.



In [2]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [3]:

    
X = [[6], [8], [10], [14],   [18]]
Y = [[7], [9], [13], [17.5], [18]]



In [4]:

    
plt.figure()
plt.title("Pizza price plotted against diameter")
plt.xlabel("Diameter in inches")
plt.ylabel("Price in dollars")
plt.plot(X,Y,"k.")
plt.axis([0,25,0,25])
plt.grid(True)
plt.show()



In [23]:

    
#Make simple linear regression
from sklearn.linear_model import LinearRegression

#Create and fit the model
model = LinearRegression()
model.fit(X,Y)

#Make preidiction
print('A 12" pizza should cost: %0.2f' %(model.predict([12])[0]))









    



A 12" pizza should cost: 13.68






    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

Evaluate fitness of model using cost (lost) function

Cost function for simple linear regression is the sum total of residuals, called RSS



In [29]:

    
import numpy as np

print("Residual sum of squares: %.2f" %np.mean((model.predict(X)-Y)**2))









    



Residual sum of squares: 1.75

Solving OLS for simple regression

Goal is to calculate vector of coefficients beta that minimizes cost function.



In [51]:

    
from __future__ import division

xbar = np.mean(X)
ybar = np.mean(Y)
print("Mean of X is:", xbar)
print("Mean of Y is:", ybar)









    



Mean of X is: 11.2
Mean of Y is: 12.9



In [55]:

    
#Make own function for variance and covariance to better understand how it works
def variance(X):
    return np.sum((X - np.mean(X))**2 / (len(X)-1))

def covariance(X,Y):
    return np.sum((X - np.mean(X)) * (Y - np.mean(Y)) / (len(X)-1))



In [56]:

    
print("Variance of X: ", variance(X))
print("Covariance of X, Y is: ", covariance(X,Y))









    



Variance of X:  23.2
Covariance of X, Y is:  22.65



In [57]:

    
#For simple linear regression, beta is cov/var. 
#Following calculation of beta, I can also get alpha a = y - bx

beta = covariance(X,Y) / variance(X)
beta









    Out[57]:





0.9762931034482758

Evaluate fit using r square



In [70]:

    
#Load another set
X_test = np.array([8,9,11,16,12])
Y_test = np.array([11,8.5,15,18,11])



In [92]:

    
model.fit(X_test.reshape(-1,1),Y_test)
model.predict(X_test.reshape(-1,1))









    Out[92]:





array([  9.54123711,  10.52835052,  12.50257732,  17.43814433,  13.48969072])



In [96]:

    
def total_sum_squares(Y):
    return np.sum((Y_test - np.mean(Y_test))**2)



In [97]:

    
#Residual sum of squares
def residual_sum_squares(Y):
    return np.sum( (Y_test - model.predict(X_test.reshape(-1,1)))**2)



In [98]:

    
#Get R square

1 - residual_sum_squares(Y_test)/total_sum_squares(Y_test)









    Out[98]:





0.66560639610861039



In [100]:

    
#From sklearn
model.score(X_test.reshape(-1,1),Y_test)









    Out[100]:





0.6656063961086105

Multiple linear regression

Add topings to our model of pizza price prediction



In [101]:

    
from numpy.linalg import inv
from numpy import dot, transpose



In [103]:

    
X = [[1, 6, 2], [1, 8, 1], [1, 10, 0], [1, 14, 2], [1, 18, 0]]
X









    Out[103]:





[[1, 6, 2], [1, 8, 1], [1, 10, 0], [1, 14, 2], [1, 18, 0]]



In [104]:

    
y = [[7], [9], [13], [17.5], [18]]



In [107]:

    
#Solve using linear algebra
dot(inv(dot(transpose(X),X)), dot(transpose(X),y))









    Out[107]:





array([[ 1.1875    ],
       [ 1.01041667],
       [ 0.39583333]])



In [109]:

    
#Solve using numpy least squares procedure
from numpy.linalg import lstsq
lstsq(X,y)[0]









    Out[109]:





array([[ 1.1875    ],
       [ 1.01041667],
       [ 0.39583333]])



In [112]:

    
#Compare simple vs multinomial
X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
Y = [[7],    [9],    [13],    [17.5],  [18]]



In [114]:

    
model = LinearRegression()
model.fit(X,Y)









    Out[114]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [115]:

    
X_test = [[8, 2], [9, 0], [11, 2], [16, 2], [12, 0]]
Y_test = [[11],   [8.5],  [15],    [18],    [11]]



In [118]:

    
predictions = model.predict(X_test)

for i, prediction in enumerate(predictions):
    print("Predicted: %s, Target: %s" %(prediction, Y_test[i]))









    



Predicted: [ 10.0625], Target: [11]
Predicted: [ 10.28125], Target: [8.5]
Predicted: [ 13.09375], Target: [15]
Predicted: [ 18.14583333], Target: [18]
Predicted: [ 13.3125], Target: [11]



In [119]:

    
print("R square:", model.score(X_test, Y_test))









    



R square: 0.770167773132

Polynomial regression

Use PolynomialFeatures to transform the data



In [120]:

    
from sklearn.preprocessing import PolynomialFeatures



In [122]:

    
X_train = [[6], [8], [10], [14],   [18]]
y_train = [[7], [9], [13], [17.5], [18]]
X_test = [[6],  [8],   [11], [16]]
y_test = [[8], [12], [15], [18]]



In [123]:

    
regressor = LinearRegression()
regressor.fit(X_train, y_train)









    Out[123]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [125]:

    
xx = np.linspace(0, 26, 100)
xx









    Out[125]:





array([  0.        ,   0.26262626,   0.52525253,   0.78787879,
         1.05050505,   1.31313131,   1.57575758,   1.83838384,
         2.1010101 ,   2.36363636,   2.62626263,   2.88888889,
         3.15151515,   3.41414141,   3.67676768,   3.93939394,
         4.2020202 ,   4.46464646,   4.72727273,   4.98989899,
         5.25252525,   5.51515152,   5.77777778,   6.04040404,
         6.3030303 ,   6.56565657,   6.82828283,   7.09090909,
         7.35353535,   7.61616162,   7.87878788,   8.14141414,
         8.4040404 ,   8.66666667,   8.92929293,   9.19191919,
         9.45454545,   9.71717172,   9.97979798,  10.24242424,
        10.50505051,  10.76767677,  11.03030303,  11.29292929,
        11.55555556,  11.81818182,  12.08080808,  12.34343434,
        12.60606061,  12.86868687,  13.13131313,  13.39393939,
        13.65656566,  13.91919192,  14.18181818,  14.44444444,
        14.70707071,  14.96969697,  15.23232323,  15.49494949,
        15.75757576,  16.02020202,  16.28282828,  16.54545455,
        16.80808081,  17.07070707,  17.33333333,  17.5959596 ,
        17.85858586,  18.12121212,  18.38383838,  18.64646465,
        18.90909091,  19.17171717,  19.43434343,  19.6969697 ,
        19.95959596,  20.22222222,  20.48484848,  20.74747475,
        21.01010101,  21.27272727,  21.53535354,  21.7979798 ,
        22.06060606,  22.32323232,  22.58585859,  22.84848485,
        23.11111111,  23.37373737,  23.63636364,  23.8989899 ,
        24.16161616,  24.42424242,  24.68686869,  24.94949495,
        25.21212121,  25.47474747,  25.73737374,  26.        ])



In [127]:

    
yy = regressor.predict(xx.reshape(xx.shape[0],1))
plt.plot(xx,yy)









    Out[127]:





[<matplotlib.lines.Line2D at 0x10c7099e8>]



In [129]:

    
quadratic_featureziser = PolynomialFeatures(degree=2)
X_train_quadratic = quadratic_featureziser.fit_transform(X_train)
X_train_quadratic









    Out[129]:





array([[   1.,    6.,   36.],
       [   1.,    8.,   64.],
       [   1.,   10.,  100.],
       [   1.,   14.,  196.],
       [   1.,   18.,  324.]])



In [130]:

    
X_test_quadratic = quadratic_featureziser.transform(X_test)



In [131]:

    
regressor_quadratic = LinearRegression()



In [132]:

    
regressor_quadratic.fit(X_train_quadratic,y_train)









    Out[132]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [133]:

    
xx_quadratic = quadratic_featureziser.transform(xx.reshape(xx.shape[0],1))



In [136]:

    
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), c='r',linestyle = '--')
plt.title("Pizza price regressed on diameter")
plt.xlabel("Diameter in inches")
plt.ylabel("Price in dollars")
plt.axis([0,25,0,25])
plt.grid(True)
plt.scatter(X_train, y_train)
plt.show()

Apply linear regression on Wine dataset from UCI



In [139]:

    
import pandas as pd



In [155]:

    
target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")
df = pd.read_csv(target_url,header=0, sep=";")



In [156]:

    
df.head()









    Out[156]:







  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
      5
    
    
      3
      11.2
      0.28
      0.56
      1.9
      0.075
      17.0
      60.0
      0.9980
      3.16
      0.58
      9.8
      6
    
    
      4
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5



In [157]:

    
df.describe()









    Out[157]:







  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      count
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
    
    
      mean
      8.319637
      0.527821
      0.270976
      2.538806
      0.087467
      15.874922
      46.467792
      0.996747
      3.311113
      0.658149
      10.422983
      5.636023
    
    
      std
      1.741096
      0.179060
      0.194801
      1.409928
      0.047065
      10.460157
      32.895324
      0.001887
      0.154386
      0.169507
      1.065668
      0.807569
    
    
      min
      4.600000
      0.120000
      0.000000
      0.900000
      0.012000
      1.000000
      6.000000
      0.990070
      2.740000
      0.330000
      8.400000
      3.000000
    
    
      25%
      7.100000
      0.390000
      0.090000
      1.900000
      0.070000
      7.000000
      22.000000
      0.995600
      3.210000
      0.550000
      9.500000
      5.000000
    
    
      50%
      7.900000
      0.520000
      0.260000
      2.200000
      0.079000
      14.000000
      38.000000
      0.996750
      3.310000
      0.620000
      10.200000
      6.000000
    
    
      75%
      9.200000
      0.640000
      0.420000
      2.600000
      0.090000
      21.000000
      62.000000
      0.997835
      3.400000
      0.730000
      11.100000
      6.000000
    
    
      max
      15.900000
      1.580000
      1.000000
      15.500000
      0.611000
      72.000000
      289.000000
      1.003690
      4.010000
      2.000000
      14.900000
      8.000000



In [158]:

    
plt.scatter(df['alcohol'], df['quality'])
plt.xlabel("Alcohol")
plt.ylabel("Quality")
plt.title("Alcohol against Quality")
plt.show()

Fit and evaluate the model



In [159]:

    
from sklearn.cross_validation import train_test_split



In [160]:

    
#Split into feature and target, train and test
X = df[list(df.columns)[:-1]]
y = df['quality']



In [162]:

    
X.head()









    Out[162]:







  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
    
    
      3
      11.2
      0.28
      0.56
      1.9
      0.075
      17.0
      60.0
      0.9980
      3.16
      0.58
      9.8
    
    
      4
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4



In [163]:

    
y.tail()









    Out[163]:





1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, dtype: int64



In [168]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y)



In [169]:

    
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_predictions = regressor.predict(X_test)



In [170]:

    
#Check R squared
print("R squared is: ", regressor.score(X_test, y_test))









    



R squared is:  0.30726198487

Cross validation



In [171]:

    
#Make cross validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(regressor, X, y, cv = 5)



In [172]:

    
print(scores.mean(), scores)









    



0.290041628842 [ 0.13200871  0.31858135  0.34955348  0.369145    0.2809196 ]

Fitting using gradient descent



In [193]:

    
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

data = load_boston()



In [194]:

    
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)



In [195]:

    
X_scaler = StandardScaler()
y_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train)
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test)









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)



In [196]:

    
regressor = SGDRegressor(loss='squared_loss')



In [197]:

    
scores = cross_val_score(regressor, X_train, y_train, cv=5)
print('Cross validation r-squared scores:', scores)
print('Average cross validation r-squared score:', np.mean(scores))
regressor.fit_transform(X_train, y_train)
print('Test set r-squared score', regressor.score(X_test, y_test))









    



Cross validation r-squared scores: [ 0.82610369  0.73023677  0.72314011  0.57012925  0.63665398]
Average cross validation r-squared score: 0.697252759707
Test set r-squared score 0.702989500158






    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
  warnings.warn(msg, category=DeprecationWarning)



In [ ]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000